BankChurners.csv - raw dataset of the project
CLIENTNUM: Client number. Unique identifier for the customer holding the accountAttrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"Customer_Age: Age in YearsGender: Gender of the account holderDependent_count: Number of dependentsEducation_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.Marital_Status: Marital Status of the account holderIncome_Category: Annual Income Category of the account holderCard_Category: Type of CardMonths_on_book: Period of relationship with the bankTotal_Relationship_Count: Total no. of products held by the customerMonths_Inactive_12_mon: No. of months inactive in the last 12 monthsContacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 monthsCredit_Limit: Credit Limit on the Credit CardTotal_Revolving_Bal: The balance that carries over from one month to the next is the revolving balanceAvg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)Total_Trans_Amt: Total Transaction Amount (Last 12 months)Total_Trans_Ct: Total Transaction Count (Last 12 months)Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarterTotal_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarterAvg_Utilization_Ratio: Represents how much of the available credit the customer spent# This will help in making the python code more structured automatically
%load_ext nb_black
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data into training and testing set
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Resize the picture
plt.rc("figure", figsize=[10, 6])
# Remove the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Set the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# ----- Packages to build the models ------
# Library to get different metric scores
from sklearn import metrics
# To impute missing values
from sklearn.impute import KNNImputer
# To build a logistic regression model
from sklearn.linear_model import LogisticRegression
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Library to tune model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Library for Bagging classifier
from sklearn.ensemble import BaggingClassifier
# Library for Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
# Library for Decision Tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# Libraries for AdaBoost, GradientBoost, Stacking
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
# Library for XGBoost classifier
from xgboost import XGBClassifier
# To impute missing values
from sklearn.impute import SimpleImputer
# To do one-hot encoding
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# To undersample and oversample the data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To be used for creating pipelines, make_pipeline and personalizing them
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
credit_card = pd.read_csv("BankChurners.csv")
# copy the data to another variable to keep the original data
data = credit_card.copy()
data.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
data.shape
(10127, 21)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
# Drop the ID column
data.drop(["CLIENTNUM"], axis=1, inplace=True)
data.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.0 | 777 | 11914.0 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.0 | 864 | 7392.0 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.0 | 0 | 3418.0 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.0 | 2517 | 796.0 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.0 | 0 | 4716.0 | 2.175 | 816 | 28 | 2.500 | 0.000 |
# Get all the columns that are object
object_columns = data.select_dtypes("object").columns
object_columns
Index(['Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category'],
dtype='object')
# Convert object to categorical type
for col in object_columns:
data[col] = data[col].astype("category")
# Check the data
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
data.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
data.duplicated().sum()
0
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.0 | 46.325960 | 8.016814 | 26.0 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.0 | 2.346203 | 1.298908 | 0.0 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.0 | 35.928409 | 7.986416 | 13.0 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.0 | 3.812580 | 1.554408 | 1.0 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.0 | 2.341167 | 1.010622 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.0 | 2.455317 | 1.106225 | 0.0 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.0 | 8631.953698 | 9088.776650 | 1438.3 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.0 | 1162.814061 | 814.987335 | 0.0 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.0 | 7469.139637 | 9090.685324 | 3.0 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.0 | 0.759941 | 0.219207 | 0.0 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.0 | 4404.086304 | 3397.129254 | 510.0 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.0 | 64.858695 | 23.472570 | 10.0 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.0 | 0.712222 | 0.238086 | 0.0 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.0 | 0.274894 | 0.275691 | 0.0 | 0.023 | 0.176 | 0.503 | 0.999 |
data.describe(exclude=np.number).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
# get categorical column names
categorical_columns = data.select_dtypes("category").columns
categorical_columns
Index(['Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category'],
dtype='object')
# get the unique values
for col in categorical_columns:
print(data[col].value_counts())
print("*" * 50)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ************************************************** F 5358 M 4769 Name: Gender, dtype: int64 ************************************************** Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ************************************************** Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 ************************************************** Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 ************************************************** Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 **************************************************
data["Income_Category"].value_counts()
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64
# Change 'abc' to 'Unknown'
data["Income_Category"].replace("abc", "Unknown", inplace=True)
# Check the data
data["Income_Category"].value_counts()
Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 Unknown 1112 $120K + 727 Name: Income_Category, dtype: int64
def generate_plot(data, feature, figsize=(10, 6), kde=True, bins=None):
"""
Description:
This is the function that generate both boxplot and histogram for any input numerical variable.
Inputs:
data: dataframe of the dataset
feature: dataframe column
figsize: size of figure (default (10,6))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
Output:
Boxplot and histogram
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
# This is for boxplot
sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="violet")
# This is for histogram
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2)
# Add mean to the histogram
ax_hist2.axvline(data[feature].mean(), color="green", linestyle="--")
# Add median to the histogram
ax_hist2.axvline(data[feature].median(), color="black", linestyle="-")
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
generate_plot(data, "Customer_Age")
generate_plot(data, "Dependent_count")
generate_plot(data, "Months_on_book")
generate_plot(data, "Total_Relationship_Count")
generate_plot(data, "Months_Inactive_12_mon")
generate_plot(data, "Contacts_Count_12_mon")
generate_plot(data, "Credit_Limit")
generate_plot(data, "Total_Revolving_Bal")
generate_plot(data, "Avg_Open_To_Buy")
generate_plot(data, "Total_Amt_Chng_Q4_Q1")
generate_plot(data, "Total_Trans_Amt")
generate_plot(data, "Total_Trans_Ct")
generate_plot(data, "Total_Ct_Chng_Q4_Q1")
generate_plot(data, "Avg_Utilization_Ratio")
def count_statistic(dataframe, feature):
'''
Description:
This is a function to count the values of each type in each variable, and also do the percentage of each type.
Inputs:
dataframe - the dataset
feature - the column name
Output:
Count of each type and percentage
'''
count_values = dataframe[feature].value_counts()
print('Counting:')
print(count_values)
print('\n')
print('Population proportion:')
print(count_values/count_values.sum())
def generate_countplot(data, feature):
"""
Description:
This is a function to do countplot
Inputs:
data - the dataset
feature - the column name
Output:
The count plot
"""
sns.countplot(data=data, x=feature)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
count_statistic(data, "Attrition_Flag")
Counting: Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 Population proportion: Existing Customer 0.83934 Attrited Customer 0.16066 Name: Attrition_Flag, dtype: float64
generate_countplot(data, "Attrition_Flag")
count_statistic(data, "Gender")
Counting: F 5358 M 4769 Name: Gender, dtype: int64 Population proportion: F 0.529081 M 0.470919 Name: Gender, dtype: float64
generate_countplot(data, "Gender")
count_statistic(data, "Education_Level")
Counting: Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 Population proportion: Graduate 0.363383 High School 0.233852 Uneducated 0.172746 College 0.117681 Post-Graduate 0.059944 Doctorate 0.052393 Name: Education_Level, dtype: float64
generate_countplot(data, "Education_Level")
count_statistic(data, "Marital_Status")
Counting: Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 Population proportion: Married 0.499787 Single 0.420452 Divorced 0.079761 Name: Marital_Status, dtype: float64
generate_countplot(data, "Marital_Status")
count_statistic(data, "Income_Category")
Counting: Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 Unknown 1112 $120K + 727 Name: Income_Category, dtype: int64 Population proportion: Less than $40K 0.351634 $40K - $60K 0.176755 $80K - $120K 0.151575 $60K - $80K 0.138442 Unknown 0.109805 $120K + 0.071788 Name: Income_Category, dtype: float64
generate_countplot(data, "Income_Category")
count_statistic(data, "Card_Category")
Counting: Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 Population proportion: Blue 0.931767 Silver 0.054804 Gold 0.011455 Platinum 0.001975 Name: Card_Category, dtype: float64
generate_countplot(data, "Card_Category")
sns.pairplot(data, diag_kind="kde", hue="Attrition_Flag")
<seaborn.axisgrid.PairGrid at 0x1c1a59a42b0>
# 2-D matrix:
correlation = data.corr()
correlation
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Customer_Age | 1.000000 | -0.122254 | 0.788912 | -0.010931 | 0.054361 | -0.018452 | 0.002476 | 0.014780 | 0.001151 | -0.062042 | -0.046446 | -0.067097 | -0.012143 | 0.007114 |
| Dependent_count | -0.122254 | 1.000000 | -0.103062 | -0.039076 | -0.010768 | -0.040505 | 0.068065 | -0.002688 | 0.068291 | -0.035439 | 0.025046 | 0.049912 | 0.011087 | -0.037135 |
| Months_on_book | 0.788912 | -0.103062 | 1.000000 | -0.009203 | 0.074164 | -0.010774 | 0.007507 | 0.008623 | 0.006732 | -0.048959 | -0.038591 | -0.049819 | -0.014072 | -0.007541 |
| Total_Relationship_Count | -0.010931 | -0.039076 | -0.009203 | 1.000000 | -0.003675 | 0.055203 | -0.071386 | 0.013726 | -0.072601 | 0.050119 | -0.347229 | -0.241891 | 0.040831 | 0.067663 |
| Months_Inactive_12_mon | 0.054361 | -0.010768 | 0.074164 | -0.003675 | 1.000000 | 0.029493 | -0.020394 | -0.042210 | -0.016605 | -0.032247 | -0.036982 | -0.042787 | -0.038989 | -0.007503 |
| Contacts_Count_12_mon | -0.018452 | -0.040505 | -0.010774 | 0.055203 | 0.029493 | 1.000000 | 0.020817 | -0.053913 | 0.025646 | -0.024445 | -0.112774 | -0.152213 | -0.094997 | -0.055471 |
| Credit_Limit | 0.002476 | 0.068065 | 0.007507 | -0.071386 | -0.020394 | 0.020817 | 1.000000 | 0.042493 | 0.995981 | 0.012813 | 0.171730 | 0.075927 | -0.002020 | -0.482965 |
| Total_Revolving_Bal | 0.014780 | -0.002688 | 0.008623 | 0.013726 | -0.042210 | -0.053913 | 0.042493 | 1.000000 | -0.047167 | 0.058174 | 0.064370 | 0.056060 | 0.089861 | 0.624022 |
| Avg_Open_To_Buy | 0.001151 | 0.068291 | 0.006732 | -0.072601 | -0.016605 | 0.025646 | 0.995981 | -0.047167 | 1.000000 | 0.007595 | 0.165923 | 0.070885 | -0.010076 | -0.538808 |
| Total_Amt_Chng_Q4_Q1 | -0.062042 | -0.035439 | -0.048959 | 0.050119 | -0.032247 | -0.024445 | 0.012813 | 0.058174 | 0.007595 | 1.000000 | 0.039678 | 0.005469 | 0.384189 | 0.035235 |
| Total_Trans_Amt | -0.046446 | 0.025046 | -0.038591 | -0.347229 | -0.036982 | -0.112774 | 0.171730 | 0.064370 | 0.165923 | 0.039678 | 1.000000 | 0.807192 | 0.085581 | -0.083034 |
| Total_Trans_Ct | -0.067097 | 0.049912 | -0.049819 | -0.241891 | -0.042787 | -0.152213 | 0.075927 | 0.056060 | 0.070885 | 0.005469 | 0.807192 | 1.000000 | 0.112324 | 0.002838 |
| Total_Ct_Chng_Q4_Q1 | -0.012143 | 0.011087 | -0.014072 | 0.040831 | -0.038989 | -0.094997 | -0.002020 | 0.089861 | -0.010076 | 0.384189 | 0.085581 | 0.112324 | 1.000000 | 0.074143 |
| Avg_Utilization_Ratio | 0.007114 | -0.037135 | -0.007541 | 0.067663 | -0.007503 | -0.055471 | -0.482965 | 0.624022 | -0.538808 | 0.035235 | -0.083034 | 0.002838 | 0.074143 | 1.000000 |
# correlation heatmap:
plt.figure(figsize=(20, 10))
sns.heatmap(correlation, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
<AxesSubplot:>
# Create a function to do stacked plot:
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(20, 6))
plt.legend(
loc="lower left",
frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
stacked_barplot(data, "Customer_Age", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Customer_Age All 1627 8500 10127 43 85 388 473 48 85 387 472 44 84 416 500 46 82 408 490 45 79 407 486 49 79 416 495 47 76 403 479 41 76 303 379 50 71 381 452 54 69 238 307 40 64 297 361 42 62 364 426 53 59 328 387 52 58 318 376 51 58 340 398 55 51 228 279 39 48 285 333 38 47 256 303 56 43 219 262 59 40 117 157 37 37 223 260 57 33 190 223 58 24 133 157 36 24 197 221 35 21 163 184 33 20 107 127 34 19 127 146 32 17 89 106 61 17 76 93 62 17 76 93 30 15 55 70 31 13 78 91 60 13 114 127 65 9 92 101 63 8 57 65 29 7 49 56 26 6 72 78 64 5 38 43 27 3 29 32 28 1 28 29 66 1 1 2 68 1 1 2 67 0 4 4 70 0 1 1 73 0 1 1 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Gender", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Dependent_count", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Dependent_count All 1627 8500 10127 3 482 2250 2732 2 417 2238 2655 1 269 1569 1838 4 260 1314 1574 0 135 769 904 5 64 360 424 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Education_Level", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Marital_Status", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1498 7880 9378 Married 709 3978 4687 Single 668 3275 3943 Divorced 121 627 748 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Income_Category", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1627 8500 10127 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 Unknown 187 925 1112 $120K + 126 601 727 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Card_Category", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Months_on_book", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Months_on_book All 1627 8500 10127 36 430 2033 2463 39 64 277 341 37 62 296 358 30 58 242 300 38 57 290 347 34 57 296 353 41 51 246 297 33 48 257 305 40 45 288 333 35 45 272 317 32 44 245 289 28 43 232 275 44 42 188 230 43 42 231 273 46 36 161 197 42 36 235 271 29 34 207 241 31 34 284 318 45 33 194 227 25 31 134 165 24 28 132 160 48 27 135 162 50 25 71 96 49 24 117 141 26 24 162 186 47 24 147 171 27 23 183 206 22 20 85 105 56 17 86 103 51 16 64 80 18 13 45 58 20 13 61 74 52 12 50 62 23 12 104 116 21 10 73 83 15 9 25 34 53 7 71 78 13 7 63 70 19 6 57 63 54 6 47 53 17 4 35 39 55 4 38 42 16 3 26 29 14 1 15 16 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Total_Relationship_Count All 1627 8500 10127 3 400 1905 2305 2 346 897 1243 1 233 677 910 5 227 1664 1891 4 225 1687 1912 6 196 1670 1866 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Months_Inactive_12_mon All 1627 8500 10127 3 826 3020 3846 2 505 2777 3282 4 130 305 435 1 100 2133 2233 5 32 146 178 6 19 105 124 0 15 14 29 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Contacts_Count_12_mon", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Contacts_Count_12_mon All 1627 8500 10127 3 681 2699 3380 2 403 2824 3227 4 315 1077 1392 1 108 1391 1499 5 59 117 176 6 54 0 54 0 7 392 399 ------------------------------------------------------------------------------------------------------------------------
stacked_barplot(data, "Credit_Limit", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Credit_Limit All 1627 8500 10127 1438.3 124 383 507 34516.0 89 419 508 9959.0 5 13 18 3261.0 3 2 5 ... ... ... ... 4969.0 0 2 2 4964.0 0 1 1 4959.0 0 1 1 4955.0 0 1 1 6514.0 0 2 2 [6206 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
distribution_plot_wrt_target(data, "Credit_Limit", "Attrition_Flag")
stacked_barplot(data, "Total_Revolving_Bal", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Total_Revolving_Bal All 1627 8500 10127 0 893 1577 2470 2517 158 350 508 1381 3 5 8 321 3 0 3 ... ... ... ... 1374 0 3 3 1373 0 3 3 1372 0 7 7 1371 0 3 3 1450 0 5 5 [1975 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
distribution_plot_wrt_target(data, "Total_Revolving_Bal", "Attrition_Flag")
stacked_barplot(data, "Avg_Open_To_Buy", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Avg_Open_To_Buy All 1627 8500 10127 1438.3 96 228 324 34516.0 39 59 98 31999.0 10 16 26 1568.0 3 0 3 ... ... ... ... 4776.0 0 1 1 4774.0 0 2 2 4772.0 0 1 1 745.0 0 3 3 5605.0 0 1 1 [6814 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
distribution_plot_wrt_target(data, "Avg_Open_To_Buy", "Attrition_Flag")
stacked_barplot(data, "Total_Amt_Chng_Q4_Q1", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Total_Amt_Chng_Q4_Q1 All 1627 8500 10127 0.602 8 14 22 0.735 7 26 33 0.654 7 15 22 0.703 7 24 31 ... ... ... ... 1.126 0 3 3 1.125 0 2 2 1.124 0 2 2 1.123 0 1 1 1.138 0 3 3 [1159 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
distribution_plot_wrt_target(data, "Total_Amt_Chng_Q4_Q1", "Attrition_Flag")
stacked_barplot(data, "Total_Trans_Amt", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Total_Trans_Amt All 1627 8500 10127 2216 5 1 6 2108 5 0 5 2312 5 1 6 2400 4 1 5 ... ... ... ... 3917 0 1 1 3918 0 4 4 3919 0 1 1 3920 0 2 2 4015 0 3 3 [5034 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
distribution_plot_wrt_target(data, "Total_Trans_Amt", "Attrition_Flag")
stacked_barplot(data, "Total_Trans_Ct", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Total_Trans_Ct All 1627 8500 10127 43 85 62 147 42 75 57 132 40 69 67 136 44 69 58 127 41 67 71 138 45 61 68 129 39 58 68 126 38 56 83 139 46 54 46 100 47 48 62 110 36 48 87 135 49 46 72 118 37 44 97 141 51 43 49 92 35 41 95 136 48 38 60 98 50 33 58 91 54 28 61 89 53 26 59 85 32 25 79 104 34 24 83 107 31 23 77 100 63 22 128 150 56 20 86 106 33 19 97 116 52 18 46 64 57 17 77 94 28 17 56 73 64 17 141 158 27 16 66 82 18 15 8 23 29 14 61 75 60 14 97 111 25 14 43 57 59 14 83 97 55 13 65 78 74 13 177 190 71 13 190 203 61 13 105 118 68 12 158 170 15 12 4 16 30 12 72 84 69 11 191 202 26 11 45 56 70 11 182 193 22 11 24 35 21 11 22 33 58 11 92 103 65 11 155 166 17 10 3 13 67 10 176 186 75 9 194 203 62 9 125 134 24 9 41 50 20 9 10 19 66 9 155 164 16 8 5 13 14 8 1 9 23 8 26 34 78 8 182 190 77 7 190 197 76 7 191 198 73 7 176 183 19 7 4 11 72 6 162 168 79 5 179 184 80 5 168 173 81 5 203 208 87 5 132 137 10 4 0 4 12 4 0 4 82 4 198 202 84 4 143 147 85 4 144 148 90 3 80 83 13 3 2 5 89 2 91 93 94 1 50 51 91 1 61 62 11 1 1 2 86 1 132 133 83 1 168 169 88 0 114 114 125 0 12 12 117 0 21 21 118 0 22 22 119 0 16 16 120 0 31 31 121 0 22 22 122 0 18 18 123 0 15 15 124 0 28 28 126 0 10 10 115 0 25 25 127 0 12 12 128 0 10 10 129 0 6 6 130 0 5 5 131 0 6 6 132 0 1 1 134 0 1 1 138 0 1 1 139 0 1 1 116 0 32 32 114 0 23 23 92 0 66 66 102 0 30 30 93 0 55 55 95 0 40 40 96 0 44 44 97 0 42 42 98 0 31 31 99 0 38 38 100 0 38 38 101 0 25 25 103 0 31 31 113 0 23 23 104 0 31 31 106 0 31 31 107 0 14 14 108 0 21 21 109 0 22 22 110 0 25 25 111 0 22 22 112 0 24 24 105 0 32 32 ------------------------------------------------------------------------------------------------------------------------
distribution_plot_wrt_target(data, "Total_Trans_Ct", "Attrition_Flag")
stacked_barplot(data, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Total_Ct_Chng_Q4_Q1 All 1627 8500 10127 0.5 58 103 161 0.6 25 88 113 0.4 25 17 42 0.429 21 18 39 ... ... ... ... 0.489 0 14 14 0.898 0 2 2 0.49 0 4 4 0.8959999999999999 0 3 3 1.024 0 3 3 [831 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")
stacked_barplot(data, "Avg_Utilization_Ratio", "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Avg_Utilization_Ratio All 1627 8500 10127 0.0 893 1577 2470 0.073 11 33 44 0.11199999999999999 5 17 22 0.318 5 6 11 ... ... ... ... 0.507 0 2 2 0.508 0 3 3 0.51 0 5 5 0.513 0 13 13 0.485 0 4 4 [965 rows x 3 columns] ------------------------------------------------------------------------------------------------------------------------
distribution_plot_wrt_target(data, "Avg_Utilization_Ratio", "Attrition_Flag")
Data Description
Univariate Data Analysis
Customer_Age: This is normal distribution. Customers that older than 68 years old are outliers.Dependent_count: There are no outliers in Dependent_count variable.Months_on_book: There are some outliers in Months_on_book variable.Total_Relationship_Count: here are no outliers in Total_Relationship_Count variable. The median is larger than the mean,which indicates this is left-skewed distribution.Months_Inactive_12_mon: There are some outliers in Months_Inactive_12_mon variable.Contacts_Count_12_mon: There are some outliers in Contacts_Count_12_mon variable.Credit_Limit: The mean is larger than median, which indicates this is right-skewed distribution. The customers that have credit limits over 24000 dollars are outliers.Total_Revolving_Bal: There are no outliers in Total_Revolving_Bal variable.Avg_Open_To_Buy: This is right-skewed distribution for Avg_Open_To_Buy variable. The customers that have over 23000 dollars left to buy in credit cards, are outliers.Total_Amt_Chng_Q4_Q1: There are outliers in Total_Amt_Chng_Q4_Q1 variable. The data looks normally distributed.Total_Trans_Amt: There are outliers in Total_Trans_Amt variable. The data is slightly right-skewed distribution.Total_Trans_Ct: There are a few outliers in Total_Trans_Ct variable.Total_Ct_Chng_Q4_Q1: There are outliers in Total_Ct_Chng_Q4_Q1 variable. The data is normally distributed.Avg_Utilization_Ratio: There are no outliers in Avg_Utilization_Ratio variable. The mean is greater than median. Hence, this is right-skewed distribution.Attrition_Flag: There are about 83.93% of existing customers are still with the bank, while 16.66% of customers are attited.Gender: There are more female in the dataset, with 52.91%.Education_Level: Most of the customers are graduate, with 36.34%, and a few of customers are doctorate with only 5.24%.Marital_Status: About 4687 customers are married with 49.98%. A few customers are divorced with 7.98%.Income_Category: Most of the customers make less than 40K dollars. There is an error in the income type, which is 'abc'. This might be because the customers don't want to report their income. The type is changed to Unknown.Card_Category: Most of the customers, which are 93.18%, have blue cards. Only 0.198% of the customers have platinum cards.Bivariate Data Analysis
Attrition_Flag vs Customer_AgeAttrition_Flag vs GenderAttrition_Flag vs Dependent_countAttrition_Flag vs Education_LevelAttrition_Flag vs Marital_StatusAttrition_Flag vs Income_CategoryAttrition_Flag vs Card_Category
Attrition_Flag vs Months_on_book
Attrition_Flag vs Total_Relationship_CountAttrition_Flag vs Months_Inactive_12_monAttrition_Flag vs Contacts_Count_12_monAttrition_Flag vs Credit_LimitAttrition_Flag vs Total_Revolving_BalAttrition_Flag vs Avg_Open_To_BuyAttrition_Flag vs Total_Amt_Chng_Q4_Q1Attrition_Flag vs Total_Trans_AmtAttrition_Flag vs Total_Trans_CtAttrition_Flag vs Total_Ct_Chng_Q4_Q1Attrition_Flag vs Avg_Utilization_Ratio# Z-score function
outlier = []
def find_z_score(data, feature, threshold=3):
"""
Description:
This is a function to detect number of outliers.
Inputs:
data - the dataset
feature - column name
threshold - value is 3 because any points that fall outside 3 standard deviation is an outlier
Output:
Number of outliers in a variables
"""
mean = np.mean(data[feature])
std = np.std(data[feature])
for value in data[feature]:
z_score = (value - mean) / std
# use absolute on z score to have more accurate result
if np.abs(z_score) > threshold:
outlier.append(value)
return outlier
# IQR function
def IQR_method(data, feature):
"""
Description:
- This is a function that uses Interquartile range (IQR) method to do outlier treatment.
- Q1 is known as 25th percentile. Q3 is known as 75th percentile. IQR= Q3-Q1
- Any data points that fall outside the minimum (Q1-1.5*IQR) and maximum (Q3+1.5*IQR) are outliers.
- Hence, the data points that are less than the minimum, will be replaced with the minimum values.
- Data points that are greater than the maximum values, will be replaced with the maximum values.
Inputs:
data - the dataset
feature - column name
Output:
Updated values for outliers
"""
Q1 = data[feature].quantile(0.25)
Q3 = data[feature].quantile(0.75)
IQR = Q3 - Q1
lower_range = Q1 - 1.5 * IQR
upper_range = Q3 + 1.5 * IQR
# replace outliers with lower range values and upper range values:
data[feature] = np.where(data[feature] < lower_range, lower_range, data[feature])
data[feature] = np.where(data[feature] > upper_range, upper_range, data[feature])
target_columns = [
"Customer_Age",
"Months_on_book",
"Months_Inactive_12_mon",
"Contacts_Count_12_mon",
"Credit_Limit",
"Avg_Open_To_Buy",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Trans_Ct",
]
# Detect number of outliers for target variables:
for column in target_columns:
outliers = find_z_score(data, column)
print("There are ", len(outliers), " outliers in ", column, " variable")
print("-" * 20)
There are 1 outliers in Customer_Age variable -------------------- There are 1 outliers in Months_on_book variable -------------------- There are 125 outliers in Months_Inactive_12_mon variable -------------------- There are 179 outliers in Contacts_Count_12_mon variable -------------------- There are 179 outliers in Credit_Limit variable -------------------- There are 179 outliers in Avg_Open_To_Buy variable -------------------- There are 342 outliers in Total_Amt_Chng_Q4_Q1 variable -------------------- There are 733 outliers in Total_Trans_Amt variable -------------------- There are 735 outliers in Total_Trans_Ct variable -------------------- There are 737 outliers in Total_Trans_Ct variable --------------------
# Outlier treatment for target variables:
for column in target_columns:
IQR_method(data, column)
# Do the plots for target variables to see if the method improves the outliers:
for column in target_columns:
generate_plot(data, column)
plt.show()
Outlier Treatment
# copy the data to another variable:
data_copy = data.copy()
# Separating target variable and other variables
x = data.drop(["Attrition_Flag"], axis=1)
y = data["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)
# Split the data into first and second part, let's say it's temporary and test
x_temp, x_test, y_temp, y_test = train_test_split(
x, y, test_size=0.2, random_state=1, stratify=y
)
# We then split the first part (temporary) set into training and validation
x_train, x_val, y_train, y_val = train_test_split(
x_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(x_train.shape, x_val.shape, x_test.shape)
(6075, 19) (2026, 19) (2026, 19)
x_train.isnull().sum()
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 928 Marital_Status 457 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
x_train.head()
| Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 800 | 40.0 | M | 2 | NaN | Single | $120K + | Blue | 21.0 | 6 | 4.0 | 3.0 | 20056.00 | 1602 | 18454.00 | 0.466 | 1687.0 | 46.0 | 0.533 | 0.080 |
| 498 | 44.0 | M | 1 | NaN | Married | Unknown | Blue | 34.0 | 6 | 2.0 | 0.5 | 2885.00 | 1895 | 990.00 | 0.387 | 1366.0 | 31.0 | 0.632 | 0.657 |
| 4356 | 48.0 | M | 4 | High School | Married | $80K - $120K | Blue | 36.0 | 5 | 1.0 | 2.0 | 6798.00 | 2517 | 4281.00 | 0.873 | 4327.0 | 79.0 | 0.881 | 0.370 |
| 407 | 41.0 | M | 2 | Graduate | NaN | $60K - $80K | Silver | 36.0 | 6 | 2.0 | 0.5 | 23836.25 | 0 | 22660.75 | 0.610 | 1209.0 | 39.0 | 0.300 | 0.000 |
| 8728 | 46.0 | M | 4 | High School | Divorced | $40K - $60K | Silver | 36.0 | 2 | 2.0 | 3.0 | 15034.00 | 1356 | 13678.00 | 0.754 | 7737.0 | 84.0 | 0.750 | 0.090 |
SimpleImputer is a scikit-learn class that helps to handle missing data by imputing techniques.
In our case, the missing values are in categorical columns. Hence, we are going to use mode.
# Create imputer
imputer = SimpleImputer(strategy="most_frequent")
imputed_columns = ["Education_Level", "Marital_Status"]
# Fit and transform the train data
x_train[imputed_columns] = imputer.fit_transform(x_train[imputed_columns])
# Transform the validation data
x_val[imputed_columns] = imputer.transform(x_val[imputed_columns])
# Transform the test data
x_test[imputed_columns] = imputer.transform(x_test[imputed_columns])
# Checking that no column has missing values in train, validation or test sets
print(x_train.isna().sum())
print("-" * 30)
print(x_val.isna().sum())
print("-" * 30)
print(x_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
x_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6075 entries, 800 to 4035 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 6075 non-null float64 1 Gender 6075 non-null category 2 Dependent_count 6075 non-null int64 3 Education_Level 6075 non-null object 4 Marital_Status 6075 non-null object 5 Income_Category 6075 non-null category 6 Card_Category 6075 non-null category 7 Months_on_book 6075 non-null float64 8 Total_Relationship_Count 6075 non-null int64 9 Months_Inactive_12_mon 6075 non-null float64 10 Contacts_Count_12_mon 6075 non-null float64 11 Credit_Limit 6075 non-null float64 12 Total_Revolving_Bal 6075 non-null int64 13 Avg_Open_To_Buy 6075 non-null float64 14 Total_Amt_Chng_Q4_Q1 6075 non-null float64 15 Total_Trans_Amt 6075 non-null float64 16 Total_Trans_Ct 6075 non-null float64 17 Total_Ct_Chng_Q4_Q1 6075 non-null float64 18 Avg_Utilization_Ratio 6075 non-null float64 dtypes: category(3), float64(11), int64(3), object(2) memory usage: 825.1+ KB
# Convert object to category:
for column in imputed_columns:
x_train[column] = x_train[column].astype("category")
x_val[column] = x_val[column].astype("category")
x_test[column] = x_test[column].astype("category")
x_train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6075 entries, 800 to 4035 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 6075 non-null float64 1 Gender 6075 non-null category 2 Dependent_count 6075 non-null int64 3 Education_Level 6075 non-null category 4 Marital_Status 6075 non-null category 5 Income_Category 6075 non-null category 6 Card_Category 6075 non-null category 7 Months_on_book 6075 non-null float64 8 Total_Relationship_Count 6075 non-null int64 9 Months_Inactive_12_mon 6075 non-null float64 10 Contacts_Count_12_mon 6075 non-null float64 11 Credit_Limit 6075 non-null float64 12 Total_Revolving_Bal 6075 non-null int64 13 Avg_Open_To_Buy 6075 non-null float64 14 Total_Amt_Chng_Q4_Q1 6075 non-null float64 15 Total_Trans_Amt 6075 non-null float64 16 Total_Trans_Ct 6075 non-null float64 17 Total_Ct_Chng_Q4_Q1 6075 non-null float64 18 Avg_Utilization_Ratio 6075 non-null float64 dtypes: category(5), float64(11), int64(3) memory usage: 742.4 KB
# target columns to create dummny variables
dummy_columns = [
"Gender",
"Education_Level",
"Marital_Status",
"Income_Category",
"Card_Category",
]
# Crate dummy variables for x_train, x_val, and x_test
x_train = pd.get_dummies(x_train, columns=dummy_columns, drop_first=True)
x_val = pd.get_dummies(x_val, columns=dummy_columns, drop_first=True)
x_test = pd.get_dummies(x_test, columns=dummy_columns, drop_first=True)
x_train.head()
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Marital_Status_Married | Marital_Status_Single | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Income_Category_Unknown | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 800 | 40.0 | 2 | 21.0 | 6 | 4.0 | 3.0 | 20056.00 | 1602 | 18454.00 | 0.466 | 1687.0 | 46.0 | 0.533 | 0.080 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 498 | 44.0 | 1 | 34.0 | 6 | 2.0 | 0.5 | 2885.00 | 1895 | 990.00 | 0.387 | 1366.0 | 31.0 | 0.632 | 0.657 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4356 | 48.0 | 4 | 36.0 | 5 | 1.0 | 2.0 | 6798.00 | 2517 | 4281.00 | 0.873 | 4327.0 | 79.0 | 0.881 | 0.370 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 407 | 41.0 | 2 | 36.0 | 6 | 2.0 | 0.5 | 23836.25 | 0 | 22660.75 | 0.610 | 1209.0 | 39.0 | 0.300 | 0.000 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 8728 | 46.0 | 4 | 36.0 | 2 | 2.0 | 3.0 | 15034.00 | 1356 | 13678.00 | 0.754 | 7737.0 | 84.0 | 0.750 | 0.090 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
x_val.head()
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Marital_Status_Married | Marital_Status_Single | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Income_Category_Unknown | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2894 | 37.0 | 0 | 27.0 | 5 | 2.0 | 3.0 | 15326.00 | 0 | 15326.00 | 1.159 | 2990.00 | 55.0 | 0.964 | 0.000 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 9158 | 58.0 | 2 | 46.0 | 1 | 3.0 | 1.0 | 10286.00 | 0 | 10286.00 | 0.908 | 8199.00 | 59.0 | 0.903 | 0.000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 9618 | 42.0 | 3 | 23.0 | 3 | 4.0 | 3.0 | 23836.25 | 2070 | 22660.75 | 0.880 | 8619.25 | 102.0 | 0.545 | 0.060 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 9910 | 47.0 | 3 | 36.0 | 3 | 2.0 | 3.0 | 9683.00 | 1116 | 8567.00 | 0.721 | 8619.25 | 104.0 | 0.825 | 0.115 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 5497 | 60.0 | 1 | 36.0 | 5 | 2.0 | 2.0 | 2688.00 | 1617 | 1071.00 | 0.552 | 4183.00 | 71.0 | 0.614 | 0.602 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
x_test.head()
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Gender_M | Education_Level_Doctorate | Education_Level_Graduate | Education_Level_High School | Education_Level_Post-Graduate | Education_Level_Uneducated | Marital_Status_Married | Marital_Status_Single | Income_Category_$40K - $60K | Income_Category_$60K - $80K | Income_Category_$80K - $120K | Income_Category_Less than $40K | Income_Category_Unknown | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9760 | 32.0 | 1 | 26.0 | 2 | 3.0 | 2.0 | 6407.00 | 1130 | 5277.0 | 0.756 | 8619.25 | 93.0 | 0.603 | 0.176 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 7413 | 50.0 | 1 | 36.0 | 4 | 3.0 | 2.0 | 2317.00 | 0 | 2317.0 | 0.734 | 2214.00 | 41.0 | 0.519 | 0.000 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6074 | 54.0 | 2 | 36.0 | 3 | 3.0 | 3.0 | 3892.00 | 0 | 3892.0 | 0.738 | 4318.00 | 74.0 | 0.762 | 0.000 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3520 | 61.0 | 0 | 36.0 | 4 | 3.0 | 4.0 | 23836.25 | 2517 | 21655.0 | 0.424 | 1658.00 | 27.0 | 0.500 | 0.104 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6103 | 41.0 | 3 | 17.5 | 5 | 3.0 | 4.0 | 4312.00 | 2517 | 1795.0 | 0.741 | 2693.00 | 56.0 | 0.436 | 0.584 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Split data into training, validation and testing set
Missing values treatment
Create dummy variables
# Confusion matrix function:
def confusion_matrix(model, predictor, target):
"""
Description:
This is the function to create confusion matrix and heatmap
Inputs:
model: classifier
predictor - independent variables
target - dependent variables
Outputs:
Heatmap plot with confusion matrix values
"""
prediction = model.predict(predictor)
cm = metrics.confusion_matrix(target, prediction)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Create a function to compute the model metrics:
def model_metrics(model, predictor, target):
"""
Description:
This is the function to compute the model metrics
Inputs:
model: classifier
predictor - independent variables
target - dependent variables
Outputs:
Model metrics
"""
# Do the prediction:
prediction = model.predict(predictor)
# Calculate the accuracy:
accuracy = metrics.accuracy_score(target, prediction)
# Calculate recall:
recall = metrics.recall_score(target, prediction)
# Calculate Precision:
precision = metrics.precision_score(target, prediction)
# Calculate F1 score:
f1 = metrics.f1_score(target, prediction)
# creating a dataframe of metrics
metrics_dataframe = pd.DataFrame(
{
"Accuracy": accuracy,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return metrics_dataframe
# Build model
logistic_regression = LogisticRegression(random_state=1)
logistic_regression.fit(x_train, y_train)
LogisticRegression(random_state=1)
# Calculate the model metrics for training dataset
logistic_regression_metrics_train = model_metrics(logistic_regression, x_train, y_train)
logistic_regression_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.881646 | 0.451844 | 0.7056 | 0.550906 |
# Confusion matrix for training dataset
confusion_matrix(logistic_regression, x_train, y_train)
# Calculate the model metrics for validation dataset
logistic_regression_metrics_val = model_metrics(logistic_regression, x_val, y_val)
logistic_regression_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.886476 | 0.506135 | 0.705128 | 0.589286 |
# Confusion matrix for validation dataset
confusion_matrix(logistic_regression, x_val, y_val)
# Build the model
decision_tree = DecisionTreeClassifier(random_state=1)
decision_tree.fit(x_train, y_train)
DecisionTreeClassifier(random_state=1)
# Calculate the model metrics for training dataset
decision_tree_metrics_train = model_metrics(decision_tree, x_train, y_train)
decision_tree_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Confusion matrix for training dataset
confusion_matrix(decision_tree, x_train, y_train)
# Calculate the model metrics for validation dataset
decision_tree_metrics_val = model_metrics(decision_tree, x_val, y_val)
decision_tree_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.937315 | 0.809816 | 0.802432 | 0.806107 |
# Confusion matrix for validation dataset
confusion_matrix(decision_tree, x_val, y_val)
# Build model
bagging = BaggingClassifier(random_state=1)
bagging.fit(x_train, y_train)
BaggingClassifier(random_state=1)
# Calculate the model metrics for training dataset
bagging_metrics_train = model_metrics(bagging, x_train, y_train)
bagging_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997695 | 0.987705 | 0.99793 | 0.992791 |
# Confusion matrix for training dataset
confusion_matrix(bagging, x_train, y_train)
# Calculate the model metrics for validation dataset
bagging_metrics_val = model_metrics(bagging, x_val, y_val)
bagging_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.950148 | 0.797546 | 0.881356 | 0.837359 |
# Confusion matrix for validation dataset
confusion_matrix(bagging, x_val, y_val)
# Build the model
ada_boost = AdaBoostClassifier(random_state=1)
ada_boost.fit(x_train, y_train)
AdaBoostClassifier(random_state=1)
# Calculate the model metrics for training dataset
ada_boost_metrics_train = model_metrics(ada_boost, x_train, y_train)
ada_boost_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.957366 | 0.83709 | 0.890949 | 0.86318 |
# Confusion matrix for training dataset
confusion_matrix(ada_boost, x_train, y_train)
# Calculate the model metrics for validation dataset
ada_boost_metrics_val = model_metrics(ada_boost, x_val, y_val)
ada_boost_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.959526 | 0.858896 | 0.886076 | 0.872274 |
# Confusion matrix for validation dataset
confusion_matrix(ada_boost, x_val, y_val)
# Build the model
gradient_boost = GradientBoostingClassifier(random_state=1)
gradient_boost.fit(x_train, y_train)
GradientBoostingClassifier(random_state=1)
# Calculate the model metrics for training dataset
gradient_boost_metrics_train = model_metrics(gradient_boost, x_train, y_train)
gradient_boost_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973827 | 0.880123 | 0.953385 | 0.91529 |
# Confusion matrix for training dataset
confusion_matrix(gradient_boost, x_train, y_train)
# Calculate the model metrics for validation dataset
gradient_boost_metrics_val = model_metrics(gradient_boost, x_val, y_val)
gradient_boost_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.969398 | 0.871166 | 0.934211 | 0.901587 |
# Confusion matrix for validation dataset
confusion_matrix(gradient_boost, x_val, y_val)
# Build the model
xg_boost = XGBClassifier(random_state=1, eval_metric="logloss")
xg_boost.fit(x_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=4,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Calculate the model metrics for training dataset
xg_boost_metrics_train = model_metrics(xg_boost, x_train, y_train)
xg_boost_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Confusion matrix for training dataset
confusion_matrix(xg_boost, x_train, y_train)
# Calculate the model metrics for validation dataset
xg_boost_metrics_val = model_metrics(xg_boost, x_val, y_val)
xg_boost_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968411 | 0.883436 | 0.917197 | 0.9 |
# Confusion matrix for validation dataset
confusion_matrix(xg_boost, x_val, y_val)
training_default_hyperparameters = pd.concat(
[
logistic_regression_metrics_train.T,
decision_tree_metrics_train.T,
bagging_metrics_train.T,
ada_boost_metrics_train.T,
gradient_boost_metrics_train.T,
xg_boost_metrics_train.T,
],
axis=1,
)
training_default_hyperparameters.columns = [
"logistic_regression_metrics_train",
"decision_tree_metrics_train",
"bagging_metrics_train",
"ada_boost_metrics_train",
"gradient_boost_metrics_train",
"xg_boost_metrics_train",
]
training_default_hyperparameters
| logistic_regression_metrics_train | decision_tree_metrics_train | bagging_metrics_train | ada_boost_metrics_train | gradient_boost_metrics_train | xg_boost_metrics_train | |
|---|---|---|---|---|---|---|
| Accuracy | 0.881646 | 1.0 | 0.997695 | 0.957366 | 0.973827 | 1.0 |
| Recall | 0.451844 | 1.0 | 0.987705 | 0.837090 | 0.880123 | 1.0 |
| Precision | 0.705600 | 1.0 | 0.997930 | 0.890949 | 0.953385 | 1.0 |
| F1 | 0.550906 | 1.0 | 0.992791 | 0.863180 | 0.915290 | 1.0 |
validation_default_hyperparameters = pd.concat(
[
logistic_regression_metrics_val.T,
decision_tree_metrics_val.T,
bagging_metrics_val.T,
ada_boost_metrics_val.T,
gradient_boost_metrics_val.T,
xg_boost_metrics_val.T,
],
axis=1,
)
validation_default_hyperparameters.columns = [
"logistic_regression_metrics_val",
"decision_tree_metrics_val",
"bagging_metrics_val",
"ada_boost_metrics_val",
"gradient_boost_metrics_val",
"xg_boost_metrics_val",
]
validation_default_hyperparameters
| logistic_regression_metrics_val | decision_tree_metrics_val | bagging_metrics_val | ada_boost_metrics_val | gradient_boost_metrics_val | xg_boost_metrics_val | |
|---|---|---|---|---|---|---|
| Accuracy | 0.886476 | 0.937315 | 0.950148 | 0.959526 | 0.969398 | 0.968411 |
| Recall | 0.506135 | 0.809816 | 0.797546 | 0.858896 | 0.871166 | 0.883436 |
| Precision | 0.705128 | 0.802432 | 0.881356 | 0.886076 | 0.934211 | 0.917197 |
| F1 | 0.589286 | 0.806107 | 0.837359 | 0.872274 | 0.901587 | 0.900000 |
Logistic Regression
Decision Tree
Bagging classifier
AdaBoost classifier
Gradient Boost classifier
XGBoost classifier
------------------ Overall ------------------
# Build models
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boost", GradientBoostingClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1, eval_metric="logloss")))
results = [] # Empty list to store all model's CV scores
names = [] # Empty list to store name of the models
# loop through all models to get the mean cross validated score
print("\n" "Cross-Validation Performance:" "\n")
for name, model in models:
scoring = "recall"
kfold = StratifiedKFold(
n_splits=5, shuffle=True, random_state=1
) # Setting number of splits equal to 5
cv_result = cross_val_score(
estimator=model, X=x_train, y=y_train, scoring=scoring, cv=kfold
)
results.append(cv_result)
names.append(name)
print("{}: {}".format(name, cv_result.mean() * 100))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(x_train, y_train)
scores = metrics.recall_score(y_train, model.predict(x_train)) * 100
print("{}: {}".format(name, scores))
Cross-Validation Performance: Logistic Regression: 43.63788592360021 Decision Tree: 78.38356881214024 Bagging: 78.4814233385662 AdaBoost: 81.34746206174779 Gradient Boost: 81.65567765567765 XGBoost: 86.36996336996339 Training Performance: Logistic Regression: 45.1844262295082 Decision Tree: 100.0 Bagging: 98.77049180327869 AdaBoost: 83.70901639344262 Gradient Boost: 88.01229508196722 XGBoost: 100.0
# Plotting boxplots for CV scores of all models defined above
fig = plt.figure(figsize=(10, 7))
fig.suptitle("Algorithm Comparison")
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
# Count the class before oversampling
print("Before Oversampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label '0': {} \n".format(sum(y_train == 0)))
Before Oversampling, counts of label '1': 976 Before Oversampling, counts of label '0': 5099
# Fit SMOTE on train data
smote = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
x_train_oversamp, y_train_oversamp = smote.fit_resample(x_train, y_train)
print("After Oversampling, counts of label '1': {}".format(sum(y_train_oversamp == 1)))
print(
"After Oversampling, counts of label '0': {} \n".format(sum(y_train_oversamp == 0))
)
print("After Oversampling, the shape of x_train: {}".format(x_train_oversamp.shape))
print("After Oversampling, the shape of y_train: {} \n".format(y_train_oversamp.shape))
After Oversampling, counts of label '1': 5099 After Oversampling, counts of label '0': 5099 After Oversampling, the shape of x_train: (10198, 30) After Oversampling, the shape of y_train: (10198,)
# Build model
logistic_regression_oversamp = LogisticRegression(random_state=1)
logistic_regression_oversamp.fit(x_train_oversamp, y_train_oversamp)
LogisticRegression(random_state=1)
# Calculate the model metrics for training dataset
logistic_oversamp_metrics_train = model_metrics(
logistic_regression_oversamp, x_train_oversamp, y_train_oversamp
)
logistic_oversamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.821043 | 0.827025 | 0.817248 | 0.822107 |
# Confusion matrix for training dataset
confusion_matrix(logistic_regression_oversamp, x_train_oversamp, y_train_oversamp)
# Calculate the model metrics for validation dataset
logistic_oversamp_metrics_val = model_metrics(
logistic_regression_oversamp, x_val, y_val
)
logistic_oversamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.815893 | 0.809816 | 0.45913 | 0.586016 |
# Confusion matrix for validation dataset
confusion_matrix(logistic_regression_oversamp, x_val, y_val)
# Build the model
decision_tree_oversamp = DecisionTreeClassifier(random_state=1)
decision_tree_oversamp.fit(x_train_oversamp, y_train_oversamp)
DecisionTreeClassifier(random_state=1)
# Calculate the model metrics for training dataset
tree_oversamp_metrics_train = model_metrics(
decision_tree_oversamp, x_train_oversamp, y_train_oversamp
)
tree_oversamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Confusion matrix for training dataset
confusion_matrix(decision_tree_oversamp, x_train_oversamp, y_train_oversamp)
# Calculate the model metrics for validation dataset
tree_oversamp_metrics_val = model_metrics(decision_tree_oversamp, x_val, y_val)
tree_oversamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.927443 | 0.834356 | 0.745205 | 0.787265 |
# Confusion matrix for validation dataset
confusion_matrix(decision_tree_oversamp, x_val, y_val)
# Build model
bagging_oversamp = BaggingClassifier(random_state=1)
bagging_oversamp.fit(x_train_oversamp, y_train_oversamp)
BaggingClassifier(random_state=1)
# Calculate the model metrics for training dataset
bagging_oversamp_metrics_train = model_metrics(
bagging_oversamp, x_train_oversamp, y_train_oversamp
)
bagging_oversamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.997549 | 0.99647 | 0.998624 | 0.997546 |
# Confusion matrix for training dataset
confusion_matrix(bagging_oversamp, x_train_oversamp, y_train_oversamp)
# Calculate the model metrics for validation dataset
bagging_oversamp_metrics_val = model_metrics(bagging_oversamp, x_val, y_val)
bagging_oversamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.940276 | 0.837423 | 0.800587 | 0.818591 |
# Confusion matrix for validation dataset
confusion_matrix(bagging_oversamp, x_val, y_val)
# Build the model
ada_boost_oversamp = AdaBoostClassifier(random_state=1)
ada_boost_oversamp.fit(x_train_oversamp, y_train_oversamp)
AdaBoostClassifier(random_state=1)
# Calculate the model metrics for training dataset
ada_boost_oversamp_metrics_train = model_metrics(
ada_boost_oversamp, x_train_oversamp, y_train_oversamp
)
ada_boost_oversamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.965581 | 0.966072 | 0.965125 | 0.965598 |
# Confusion matrix for training dataset
confusion_matrix(ada_boost_oversamp, x_train_oversamp, y_train_oversamp)
# Calculate the model metrics for validation dataset
ada_boost_oversamp_metrics_val = model_metrics(ada_boost_oversamp, x_val, y_val)
ada_boost_oversamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.946199 | 0.865031 | 0.81268 | 0.838039 |
# Confusion matrix for validation dataset
confusion_matrix(ada_boost_oversamp, x_val, y_val)
# Build the model
gradient_boost_oversamp = GradientBoostingClassifier(random_state=1)
gradient_boost_oversamp.fit(x_train_oversamp, y_train_oversamp)
GradientBoostingClassifier(random_state=1)
# Calculate the model metrics for training dataset
gradient_boost_oversamp_metrics_train = model_metrics(
gradient_boost_oversamp, x_train_oversamp, y_train_oversamp
)
gradient_boost_oversamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.979702 | 0.981957 | 0.977548 | 0.979748 |
# Confusion matrix for training dataset
confusion_matrix(gradient_boost_oversamp, x_train_oversamp, y_train_oversamp)
# Calculate the model metrics for validation dataset
gradient_boost_oversamp_metrics_val = model_metrics(
gradient_boost_oversamp, x_val, y_val
)
gradient_boost_oversamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.959033 | 0.889571 | 0.860534 | 0.874811 |
# Confusion matrix for validation dataset
confusion_matrix(gradient_boost_oversamp, x_val, y_val)
# Build the model
xg_boost_oversamp = XGBClassifier(random_state=1, eval_metric="logloss")
xg_boost_oversamp.fit(x_train_oversamp, y_train_oversamp)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=4,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Calculate the model metrics for training dataset
xg_boost_oversamp_metrics_train = model_metrics(
xg_boost_oversamp, x_train_oversamp, y_train_oversamp
)
xg_boost_oversamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Confusion matrix for training dataset
confusion_matrix(xg_boost_oversamp, x_train_oversamp, y_train_oversamp)
# Calculate the model metrics for validation dataset
xg_boost_oversamp_metrics_val = model_metrics(xg_boost_oversamp, x_val, y_val)
xg_boost_oversamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.969891 | 0.91411 | 0.900302 | 0.907154 |
# Confusion matrix for validation dataset
confusion_matrix(xg_boost_oversamp, x_val, y_val)
training_oversampling = pd.concat(
[
logistic_oversamp_metrics_train.T,
tree_oversamp_metrics_train.T,
bagging_oversamp_metrics_train.T,
ada_boost_oversamp_metrics_train.T,
gradient_boost_oversamp_metrics_train.T,
xg_boost_oversamp_metrics_train.T,
],
axis=1,
)
training_oversampling.columns = [
"logistic_oversamp_metrics_train",
"tree_oversamp_metrics_train",
"bagging_oversamp_metrics_train",
"ada_boost_oversamp_metrics_train",
"gradient_boost_oversamp_metrics_train",
"xg_boost_oversamp_metrics_train",
]
training_oversampling
| logistic_oversamp_metrics_train | tree_oversamp_metrics_train | bagging_oversamp_metrics_train | ada_boost_oversamp_metrics_train | gradient_boost_oversamp_metrics_train | xg_boost_oversamp_metrics_train | |
|---|---|---|---|---|---|---|
| Accuracy | 0.821043 | 1.0 | 0.997549 | 0.965581 | 0.979702 | 1.0 |
| Recall | 0.827025 | 1.0 | 0.996470 | 0.966072 | 0.981957 | 1.0 |
| Precision | 0.817248 | 1.0 | 0.998624 | 0.965125 | 0.977548 | 1.0 |
| F1 | 0.822107 | 1.0 | 0.997546 | 0.965598 | 0.979748 | 1.0 |
validation_oversampling = pd.concat(
[
logistic_oversamp_metrics_val.T,
tree_oversamp_metrics_val.T,
bagging_oversamp_metrics_val.T,
ada_boost_oversamp_metrics_val.T,
gradient_boost_oversamp_metrics_val.T,
xg_boost_oversamp_metrics_val.T,
],
axis=1,
)
validation_oversampling.columns = [
"logistic_oversamp_metrics_val",
"tree_oversamp_metrics_val",
"bagging_oversamp_metrics_val",
"ada_boost_oversamp_metrics_val",
"gradient_boost_oversamp_metrics_val",
"xg_boost_oversamp_metrics_val",
]
validation_oversampling
| logistic_oversamp_metrics_val | tree_oversamp_metrics_val | bagging_oversamp_metrics_val | ada_boost_oversamp_metrics_val | gradient_boost_oversamp_metrics_val | xg_boost_oversamp_metrics_val | |
|---|---|---|---|---|---|---|
| Accuracy | 0.815893 | 0.927443 | 0.940276 | 0.946199 | 0.959033 | 0.969891 |
| Recall | 0.809816 | 0.834356 | 0.837423 | 0.865031 | 0.889571 | 0.914110 |
| Precision | 0.459130 | 0.745205 | 0.800587 | 0.812680 | 0.860534 | 0.900302 |
| F1 | 0.586016 | 0.787265 | 0.818591 | 0.838039 | 0.874811 | 0.907154 |
Logistic Regression with oversampled data
Decision Tree with oversampled data
Bagging with oversampled data
AdaBoost with oversampled data
Gradient Boost with oversampled data
XGBoost with oversampled data
--------- Overall ---------
# Count the class before oversampling
print("Before Oversampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label '0': {} \n".format(sum(y_train == 0)))
Before Oversampling, counts of label '1': 976 Before Oversampling, counts of label '0': 5099
# fit random under sampler on the train data
undersampler = RandomUnderSampler(random_state=1, sampling_strategy=1)
x_train_undersamp, y_train_undersamp = undersampler.fit_resample(x_train, y_train)
print(
"After Under Sampling, count of label '1': {}".format(sum(y_train_undersamp == 1))
)
print(
"After Under Sampling, count of label '0': {} \n".format(
sum(y_train_undersamp == 0)
)
)
print("After Under Sampling, the shape of x_train: {}".format(x_train_undersamp.shape))
print(
"After Under Sampling, the shape of y_train: {} \n".format(x_train_undersamp.shape)
)
After Under Sampling, count of label '1': 976 After Under Sampling, count of label '0': 976 After Under Sampling, the shape of x_train: (1952, 30) After Under Sampling, the shape of y_train: (1952, 30)
# Build model
logistic_regression_undersamp = LogisticRegression(random_state=1)
logistic_regression_undersamp.fit(x_train_undersamp, y_train_undersamp)
LogisticRegression(random_state=1)
# Calculate the model metrics for training dataset
logistic_undersamp_metrics_train = model_metrics(
logistic_regression_undersamp, x_train_undersamp, y_train_undersamp
)
logistic_undersamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.807377 | 0.802254 | 0.810559 | 0.806385 |
# Confusion matrix for training dataset
confusion_matrix(logistic_regression_undersamp, x_train_undersamp, y_train_undersamp)
# Calculate the model metrics for validation dataset
logistic_undersamp_metrics_val = model_metrics(
logistic_regression_undersamp, x_val, y_val
)
logistic_undersamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.804541 | 0.806748 | 0.441275 | 0.570499 |
# Confusion matrix for validation dataset
confusion_matrix(logistic_regression_undersamp, x_val, y_val)
# Build the model
decision_tree_undersamp = DecisionTreeClassifier(random_state=1)
decision_tree_undersamp.fit(x_train_undersamp, y_train_undersamp)
DecisionTreeClassifier(random_state=1)
# Calculate the model metrics for training dataset
tree_undersamp_metrics_train = model_metrics(
decision_tree_undersamp, x_train_undersamp, y_train_undersamp
)
tree_undersamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Confusion matrix for training dataset
confusion_matrix(decision_tree_undersamp, x_train_undersamp, y_train_undersamp)
# Calculate the model metrics for validation dataset
tree_undersamp_metrics_val = model_metrics(decision_tree_undersamp, x_val, y_val)
tree_undersamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.888944 | 0.920245 | 0.601202 | 0.727273 |
# Confusion matrix for validation dataset
confusion_matrix(decision_tree_undersamp, x_val, y_val)
# Build model
bagging_undersamp = BaggingClassifier(random_state=1)
bagging_undersamp.fit(x_train_undersamp, y_train_undersamp)
BaggingClassifier(random_state=1)
# Calculate the model metrics for training dataset
bagging_undersamp_metrics_train = model_metrics(
bagging_undersamp, x_train_undersamp, y_train_undersamp
)
bagging_undersamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.994877 | 0.991803 | 0.997938 | 0.994861 |
# Confusion matrix for training dataset
confusion_matrix(bagging_undersamp, x_train_undersamp, y_train_undersamp)
# Calculate the model metrics for validation dataset
bagging_undersamp_metrics_val = model_metrics(bagging_undersamp, x_val, y_val)
bagging_undersamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.920039 | 0.920245 | 0.688073 | 0.787402 |
# Confusion matrix for validation dataset
confusion_matrix(bagging_undersamp, x_val, y_val)
# Build the model
ada_boost_undersamp = AdaBoostClassifier(random_state=1)
ada_boost_undersamp.fit(x_train_undersamp, y_train_undersamp)
AdaBoostClassifier(random_state=1)
# Calculate the model metrics for training dataset
ada_boost_undersamp_metrics_train = model_metrics(
ada_boost_undersamp, x_train_undersamp, y_train_undersamp
)
ada_boost_undersamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.942623 | 0.947746 | 0.938134 | 0.942915 |
# Confusion matrix for training dataset
confusion_matrix(ada_boost_undersamp, x_train_undersamp, y_train_undersamp)
# Calculate the model metrics for validation dataset
ada_boost_undersamp_metrics_val = model_metrics(ada_boost_undersamp, x_val, y_val)
ada_boost_undersamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.928924 | 0.95092 | 0.707763 | 0.811518 |
# Confusion matrix for validation dataset
confusion_matrix(ada_boost_undersamp, x_val, y_val)
# Build the model
gradient_boost_undersamp = GradientBoostingClassifier(random_state=1)
gradient_boost_undersamp.fit(x_train_undersamp, y_train_undersamp)
GradientBoostingClassifier(random_state=1)
# Calculate the model metrics for training dataset
gradient_boost_undersamp_metrics_train = model_metrics(
gradient_boost_undersamp, x_train_undersamp, y_train_undersamp
)
gradient_boost_undersamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.974898 | 0.980533 | 0.969605 | 0.975038 |
# Confusion matrix for training dataset
confusion_matrix(gradient_boost_undersamp, x_train_undersamp, y_train_undersamp)
# Calculate the model metrics for validation dataset
gradient_boost_undersamp_metrics_val = model_metrics(
gradient_boost_undersamp, x_val, y_val
)
gradient_boost_undersamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.936328 | 0.957055 | 0.730679 | 0.828685 |
# Confusion matrix for validation dataset
confusion_matrix(gradient_boost_undersamp, x_val, y_val)
# Build the model
xg_boost_undersamp = XGBClassifier(random_state=1, eval_metric="logloss")
xg_boost_undersamp.fit(x_train_undersamp, y_train_undersamp)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, eval_metric='logloss',
gamma=0, gpu_id=-1, importance_type='gain',
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=4,
num_parallel_tree=1, random_state=1, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
# Calculate the model metrics for training dataset
xg_boost_undersamp_metrics_train = model_metrics(
xg_boost_undersamp, x_train_undersamp, y_train_undersamp
)
xg_boost_undersamp_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Confusion matrix for training dataset
confusion_matrix(xg_boost_undersamp, x_train_undersamp, y_train_undersamp)
# Calculate the model metrics for validation dataset
xg_boost_undersamp_metrics_val = model_metrics(xg_boost_undersamp, x_val, y_val)
xg_boost_undersamp_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.939289 | 0.953988 | 0.742243 | 0.834899 |
# Confusion matrix for validation dataset
confusion_matrix(xg_boost_undersamp, x_val, y_val)
training_undersampling = pd.concat(
[
logistic_undersamp_metrics_train.T,
tree_undersamp_metrics_train.T,
bagging_undersamp_metrics_train.T,
ada_boost_undersamp_metrics_train.T,
gradient_boost_undersamp_metrics_train.T,
xg_boost_undersamp_metrics_train.T,
],
axis=1,
)
training_undersampling.columns = [
"logistic_undersamp_metrics_train",
"tree_undersamp_metrics_train",
"bagging_undersamp_metrics_train",
"ada_boost_undersamp_metrics_train",
"gradient_boost_undersamp_metrics_train",
"xg_boost_undersamp_metrics_train",
]
training_undersampling
| logistic_undersamp_metrics_train | tree_undersamp_metrics_train | bagging_undersamp_metrics_train | ada_boost_undersamp_metrics_train | gradient_boost_undersamp_metrics_train | xg_boost_undersamp_metrics_train | |
|---|---|---|---|---|---|---|
| Accuracy | 0.807377 | 1.0 | 0.994877 | 0.942623 | 0.974898 | 1.0 |
| Recall | 0.802254 | 1.0 | 0.991803 | 0.947746 | 0.980533 | 1.0 |
| Precision | 0.810559 | 1.0 | 0.997938 | 0.938134 | 0.969605 | 1.0 |
| F1 | 0.806385 | 1.0 | 0.994861 | 0.942915 | 0.975038 | 1.0 |
validation_undersampling = pd.concat(
[
logistic_undersamp_metrics_val.T,
tree_undersamp_metrics_val.T,
bagging_undersamp_metrics_val.T,
ada_boost_undersamp_metrics_val.T,
gradient_boost_undersamp_metrics_val.T,
xg_boost_undersamp_metrics_val.T,
],
axis=1,
)
validation_undersampling.columns = [
"logistic_undersamp_metrics_val",
"tree_undersamp_metrics_val",
"bagging_undersamp_metrics_val",
"ada_boost_undersamp_metrics_val",
"gradient_boost_undersamp_metrics_val",
"xg_boost_undersamp_metrics_val",
]
validation_undersampling
| logistic_undersamp_metrics_val | tree_undersamp_metrics_val | bagging_undersamp_metrics_val | ada_boost_undersamp_metrics_val | gradient_boost_undersamp_metrics_val | xg_boost_undersamp_metrics_val | |
|---|---|---|---|---|---|---|
| Accuracy | 0.804541 | 0.888944 | 0.920039 | 0.928924 | 0.936328 | 0.939289 |
| Recall | 0.806748 | 0.920245 | 0.920245 | 0.950920 | 0.957055 | 0.953988 |
| Precision | 0.441275 | 0.601202 | 0.688073 | 0.707763 | 0.730679 | 0.742243 |
| F1 | 0.570499 | 0.727273 | 0.787402 | 0.811518 | 0.828685 | 0.834899 |
Logistic Regression with undersampled data
Decision Tree with undersampled data
Bagging with undersampled data
AdaBoost with undersampled data
Gradient Boost with undersampled data
XGBoost with undersampled data
---------- Overall ----------
%%time
# Build the model
gradient_boost_tuned = GradientBoostingClassifier(random_state=1)
gradient_boost_tuned.fit(x_train, y_train)
Wall time: 1.68 s
GradientBoostingClassifier(random_state=1)
# Check available parameters for Gradient Boost
gradient_boost_tuned.get_params().keys()
dict_keys(['ccp_alpha', 'criterion', 'init', 'learning_rate', 'loss', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_iter_no_change', 'random_state', 'subsample', 'tol', 'validation_fraction', 'verbose', 'warm_start'])
# Parameters for tuning:
parameters = {
"n_estimators": np.arange(50, 150, 50),
"learning_rate": [0.01, 0.1, 0.2, 0.05],
"subsample": [0.8, 0.9, 1],
"max_depth": np.arange(1, 5, 1),
}
# Type of scoring used to compare parameter combinations - We will choose Recall
recall_score = metrics.make_scorer(metrics.recall_score)
# Run the random search
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10
gradient_boost_tuned_random_search = RandomizedSearchCV(
gradient_boost_tuned,
parameters,
n_iter=30,
scoring=recall_score,
cv=5,
random_state=1,
n_jobs=-1,
verbose=2,
)
gradient_boost_tuned_random_search = gradient_boost_tuned_random_search.fit(
x_train, y_train
)
Fitting 5 folds for each of 30 candidates, totalling 150 fits
# Print the best combination of parameters
gradient_boost_tuned_random_search.best_params_
{'subsample': 1, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.2}
# Build the model with the best combination of parameters
gradient_boost_tuned = GradientBoostingClassifier(
random_state=1, subsample=1, n_estimators=100, max_depth=3, learning_rate=0.2
)
gradient_boost_tuned.fit(x_train, y_train)
GradientBoostingClassifier(learning_rate=0.2, random_state=1, subsample=1)
# Calculate the model metrics for training dataset
gradient_boost_tuned_metrics_train = model_metrics(
gradient_boost_tuned, x_train, y_train
)
gradient_boost_tuned_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.985844 | 0.944672 | 0.966457 | 0.95544 |
# Confusion matrix for training dataset
confusion_matrix(gradient_boost_tuned, x_train, y_train)
# Calculate the model metrics for validation dataset
gradient_boost_tuned_metrics_val = model_metrics(gradient_boost_tuned, x_val, y_val)
gradient_boost_tuned_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.972853 | 0.883436 | 0.944262 | 0.912837 |
# Confusion matrix for validation dataset
confusion_matrix(gradient_boost_tuned, x_val, y_val)
%%time
# Build the model
gradient_boost_oversamp_tuned = GradientBoostingClassifier(random_state=1)
gradient_boost_oversamp_tuned.fit(x_train_oversamp, y_train_oversamp)
Wall time: 2.79 s
GradientBoostingClassifier(random_state=1)
# Check available parameters for Gradient Boost
gradient_boost_oversamp_tuned.get_params().keys()
dict_keys(['ccp_alpha', 'criterion', 'init', 'learning_rate', 'loss', 'max_depth', 'max_features', 'max_leaf_nodes', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_iter_no_change', 'random_state', 'subsample', 'tol', 'validation_fraction', 'verbose', 'warm_start'])
# Parameters for tuning:
parameters = {
"n_estimators": np.arange(50, 150, 50),
"learning_rate": [0.01, 0.1, 0.2, 0.05],
"subsample": [0.8, 0.9, 1],
"max_depth": np.arange(1, 5, 1),
}
# Type of scoring used to compare parameter combinations - We will choose Recall
recall_score = metrics.make_scorer(metrics.recall_score)
# Run the random search
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10
gradient_boost_oversamp_tuned_random_search = RandomizedSearchCV(
gradient_boost_oversamp_tuned,
parameters,
n_iter=30,
scoring=recall_score,
cv=5,
random_state=1,
n_jobs=-1,
verbose=2,
)
gradient_boost_oversamp_tuned_random_search = (
gradient_boost_oversamp_tuned_random_search.fit(x_train_oversamp, y_train_oversamp)
)
Fitting 5 folds for each of 30 candidates, totalling 150 fits
# Print the best combination of parameters
gradient_boost_oversamp_tuned_random_search.best_params_
{'subsample': 0.8, 'n_estimators': 50, 'max_depth': 4, 'learning_rate': 0.05}
# Build the model with the best combination of parameters
gradient_boost_oversamp_tuned = GradientBoostingClassifier(
random_state=1, subsample=0.8, n_estimators=50, max_depth=4, learning_rate=0.05
)
gradient_boost_oversamp_tuned.fit(x_train_oversamp, y_train_oversamp)
GradientBoostingClassifier(learning_rate=0.05, max_depth=4, n_estimators=50,
random_state=1, subsample=0.8)
# Calculate the model metrics for training dataset
gradient_boost_oversamp_tuned_metrics_train = model_metrics(
gradient_boost_oversamp_tuned, x_train_oversamp, y_train_oversamp
)
gradient_boost_oversamp_tuned_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.960973 | 0.968229 | 0.954379 | 0.961254 |
# Confusion matrix for training dataset
confusion_matrix(gradient_boost_oversamp_tuned, x_train_oversamp, y_train_oversamp)
# Calculate the model metrics for validation dataset
gradient_boost_oversamp_tuned_metrics_val = model_metrics(
gradient_boost_oversamp_tuned, x_val, y_val
)
gradient_boost_oversamp_tuned_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.941757 | 0.892638 | 0.778075 | 0.831429 |
# Confusion matrix for validation dataset
confusion_matrix(gradient_boost_oversamp_tuned, x_val, y_val)
%%time
# Build the model
ada_boost_undersamp_tuned = AdaBoostClassifier(random_state=1)
ada_boost_undersamp_tuned.fit(x_train_undersamp, y_train_undersamp)
Wall time: 174 ms
AdaBoostClassifier(random_state=1)
# Check available parameters for Gradient Boost
ada_boost_undersamp_tuned.get_params().keys()
dict_keys(['algorithm', 'base_estimator', 'learning_rate', 'n_estimators', 'random_state'])
# Parameters for tuning:
parameters = {
# Let's try different max_depth for base_estimator
"base_estimator": [
DecisionTreeClassifier(max_depth=1),
DecisionTreeClassifier(max_depth=2),
DecisionTreeClassifier(max_depth=3),
],
"n_estimators": np.arange(10, 110, 10),
"learning_rate": np.arange(0.1, 2, 0.1),
}
# Type of scoring used to compare parameter combinations - We will choose Recall
recall_score = metrics.make_scorer(metrics.recall_score)
# Run the random search
# using n_iter = 30, so randomized search will try 30 different combinations of hyperparameters
# by default, n_iter = 10
ada_boost_undersamp_tuned_random_search = RandomizedSearchCV(
ada_boost_undersamp_tuned,
parameters,
n_iter=30,
scoring=recall_score,
cv=5,
random_state=1,
n_jobs=-1,
verbose=2,
)
ada_boost_undersamp_tuned_random_search = ada_boost_undersamp_tuned_random_search.fit(
x_train_undersamp, y_train_undersamp
)
Fitting 5 folds for each of 30 candidates, totalling 150 fits
# Print the best combination of parameters
ada_boost_undersamp_tuned = ada_boost_undersamp_tuned_random_search.best_estimator_
ada_boost_undersamp_tuned
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=0.7000000000000001, random_state=1)
# Build the model with the best combination of parameters
ada_boost_undersamp_tuned.fit(x_train_undersamp, y_train_undersamp)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=0.7000000000000001, random_state=1)
# Calculate the model metrics for training dataset
ada_boost_undersamp_tuned_metrics_train = model_metrics(
ada_boost_undersamp_tuned, x_train_undersamp, y_train_undersamp
)
ada_boost_undersamp_tuned_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Confusion matrix for training dataset
confusion_matrix(ada_boost_undersamp_tuned, x_train_undersamp, y_train_undersamp)
# Calculate the model metrics for validation dataset
ada_boost_undersamp_tuned_metrics_val = model_metrics(
ada_boost_undersamp_tuned, x_val, y_val
)
ada_boost_undersamp_tuned_metrics_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.937315 | 0.96319 | 0.731935 | 0.831788 |
# Confusion matrix for validation dataset
confusion_matrix(ada_boost_undersamp_tuned, x_val, y_val)
training_tuned = pd.concat(
[
gradient_boost_tuned_metrics_train.T,
gradient_boost_oversamp_tuned_metrics_train.T,
ada_boost_undersamp_tuned_metrics_train.T
],
axis=1,
)
training_tuned.columns = [
"gradient_boost_tuned_metrics_train",
"gradient_boost_oversamp_tuned_metrics_train",
"ada_boost_undersamp_tuned_metrics_train"
]
training_tuned
| gradient_boost_tuned_metrics_train | gradient_boost_oversamp_tuned_metrics_train | ada_boost_undersamp_tuned_metrics_train | |
|---|---|---|---|
| Accuracy | 0.985844 | 0.960973 | 1.0 |
| Recall | 0.944672 | 0.968229 | 1.0 |
| Precision | 0.966457 | 0.954379 | 1.0 |
| F1 | 0.955440 | 0.961254 | 1.0 |
validation_tuned = pd.concat(
[
gradient_boost_tuned_metrics_val.T,
gradient_boost_oversamp_tuned_metrics_val.T,
ada_boost_undersamp_tuned_metrics_val.T,
],
axis=1,
)
validation_tuned.columns = [
"gradient_boost_tuned_metrics_val",
"gradient_boost_oversamp_tuned_metrics_val",
"ada_boost_undersamp_tuned_metrics_val",
]
validation_tuned
| gradient_boost_tuned_metrics_val | gradient_boost_oversamp_tuned_metrics_val | ada_boost_undersamp_tuned_metrics_val | |
|---|---|---|---|
| Accuracy | 0.972853 | 0.941757 | 0.937315 |
| Recall | 0.883436 | 0.892638 | 0.963190 |
| Precision | 0.944262 | 0.778075 | 0.731935 |
| F1 | 0.912837 | 0.831429 | 0.831788 |
Gradient Boost from default hyperparameters
Gradient Boost from oversampling
AdaBoost from undersampling
--------- Overall ---------
feature_names = x_train_oversamp.columns
importances = gradient_boost_oversamp_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# get columns that are numeric
numerical_features = data.select_dtypes(["float64", "int64"]).columns.tolist()
numerical_features
['Customer_Age', 'Dependent_count', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio']
# creating a transformer for numerical variables, which will apply simple imputer on the numerical variables
# numeric_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="median"))])
numeric_transformer = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
numeric_transformer
Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
('standardscaler', StandardScaler())])
# get columns that are category
categorical_features = data.select_dtypes(["category"]).columns.tolist()
# remove target variable
categorical_features.remove("Attrition_Flag")
categorical_features
['Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
# creating a transformer for categorical variables, which will first apply simple imputer and
# then do one hot encoding for categorical variables
# handle_unknown = "ignore", allows model to handle any unknown category in the test data
categorical_transformer = Pipeline(
steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
)
# combining categorical transformer and numerical transformer using a column transformer
# remainder = "passthrough" has been used, it will allow variables that are present in original data
# but not in "numerical_columns" and "categorical_columns" to pass through the column transformer without any changes
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical_features),
("cat", categorical_transformer, categorical_features),
],
remainder="passthrough",
)
preprocessor
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
['Customer_Age', 'Dependent_count',
'Months_on_book', 'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit',
'Total_Revolving_Bal', 'Avg_Open_To_Buy',
'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1',
'Avg_Utilization_Ratio']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='most_frequent')),
('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
['Gender', 'Education_Level', 'Marital_Status',
'Income_Category', 'Card_Category'])])
# Separating target variable and other variables
x = data.drop(["Attrition_Flag"], axis=1)
y = data["Attrition_Flag"].apply(lambda x: 1 if x == "Attrited Customer" else 0)
# Splitting the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.30, random_state=1, stratify=y
)
print(x_train.shape, x_test.shape)
(7088, 19) (3039, 19)
# build model
pipeline_gradient_boost = GradientBoostingClassifier(
random_state=1,
subsample=0.8,
n_estimators=50,
max_depth=4,
learning_rate=0.05,
)
# Creating new pipeline with best parameters
pipeline_model = Pipeline(
steps=[("preprocessor", preprocessor), ("model", pipeline_gradient_boost)]
)
# Fit the model on training data
pipeline_model.fit(x_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
['Customer_Age',
'Dependent_count',
'Months_on_book',
'Total_Relationship_Count',
'Months_Inactive_12_mon',
'Contacts_Count_12_mon',
'Credit_Limit',
'Total_R...
'Avg_Utilization_Ratio']),
('cat',
Pipeline(steps=[('imputer',
SimpleImputer(strategy='most_frequent')),
('onehot',
OneHotEncoder(handle_unknown='ignore'))]),
['Gender', 'Education_Level',
'Marital_Status',
'Income_Category',
'Card_Category'])])),
('model',
GradientBoostingClassifier(learning_rate=0.05, max_depth=4,
n_estimators=50, random_state=1,
subsample=0.8))])
# Calculate the model metrics for training dataset
pipeline_model_metrics_train = model_metrics(pipeline_model, x_train, y_train)
pipeline_model_metrics_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.958663 | 0.784021 | 0.95 | 0.859067 |
# Confusion matrix for training dataset
confusion_matrix(pipeline_model, x_train, y_train)
# Calculate the model metrics for testing dataset
pipeline_model_metrics_test = model_metrics(pipeline_model, x_test, y_test)
pipeline_model_metrics_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.949984 | 0.75 | 0.924242 | 0.828054 |
# Confusion matrix for training dataset
confusion_matrix(pipeline_model, x_test, y_test)
# transforming and predicting on test data
pipeline_model.predict(x_test)
array([0, 1, 0, ..., 0, 0, 0], dtype=int64)
There are about 16.66% of customers that have churned, and 83.93% of customers are still staying with the bank. This indicates the relationship between the customers and the bank is quite good. However, the bank needs to do some adjustments to keep and attract more customers.
The customers that have less transactions or less balance to carry over to the next month are likely going to churn the bank. The bank can help to increase more balance in customers' credit card, and might to have promotions or deals for lower interest rates to attract more customers.
Besides, the bank can also have a bonus cash deal for customers that make more transactions or spend certain amount of money in a month. Hence, this would attract more customers.
As we can see from above analysis, married customers tend to churn the bank more than customers that are divorced. The bank should have promotions for couples when they sign up together and lower interest rest for them as well.
The more products that the customers have with the bank, the less likely they are going to churn. Hence, the bank should create more attractive products and can benefit the customers in order for them to stay.
From our analysis, the customers that make over 120k are less likely going to churn compared to those that make less than 40k. In the strategy, the bank can offer lower interest rate deal for the low-income customers, and can provide assistance and ways to pay the credit cards.